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ABSTRACT 


In documents retrieval system, it is a common practice 
to organize the documents into clusters and search only a 
few promising clusters. In such an environment it is of 
interest to estimate, with respect to a given query, the 
number of "desired documents" in each cluster. One of the 
reasonable approaches to this problem is to predict the 
distribution of query-document correlations by using the 
cluster representative. The purpose of this thesis is to 
evaluate the accuracy of a number of theoretical 
distributions in predicting the distribution of query- 
document correlations. They are the C-function, the Poisson 
distribution, and the normal distribution. Two modified C- 
functions are also proposed to alleviate some of the 
difficulties inherent in the use of the C-function. Since 
terms are found to be dependent on each other, a 
transformation that is expected to remove the dependencies 
between terms is attempted. The distribution of correlations 


for the newly transformed data is also derived. 
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CHARTER 1 


INTRODUCTION 


An information: retrieval system consists of a set of 
documents in machine-readable form stored to provide 
information services to various users. A document or a query 
Can be represented by an n-dimensional vector whose ith 
component indicates the importance of the ith term in the 


document or the query. 


In response to a given query, the system may compute 
the closeness or the correlation between each document and 
the query in terms of a chosen Similarity function (for 
example, a simple matching function computes the number of 
terms in common between the document and the query.) 
Documents which have correlation greater than a given 
threshold are then retrieved. We refer to this set of 
documents as the "desired documents", Since the number of 
documents in the system is usually very large (for example, 
a stored collection of books and journals in a library), it 
is impractical to examine each document and determine its 
closeness to the query. One way to remedy this situation is 
to divide the data base into clusters and examine only the 
clusters which have high probabilities of containing many 
desired documents. A cluster consists of "Similar" documents 
and a representative can be constructed for each cluster to 


provide the required probability (see section 2.1.) Besides 
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being useful in the selection of promising clusters, an 
estimate of the number of desired documents in each cluster 
can also be used for some other applications in information 
retrieval such as the following: 

(1). In order to reduce the amount of computing time, the 
system may search only few related clusters. In such 
a situation, the user may be interested in knowing 
the percentage loss of desired documents. The 
estimation of desired document will allow the system 
to provide this information. If this percentage is 
too high, the user may require the system to search 
more clusters. 

(2). This estimate will also enable the system to give 
the user information about the number of documents 
that are likely to he retrieved. If this number is 
too large, the user may decide to add more 
constraints in his query, thereby reducing the 
searching time as well as the number of documents to 


be retrieved. 


With respect to a given query, a distribution of 
correlations of a cluster of documents represents the 
(celative) frequencies of documents at different correlation 
values. If the distribution of correlations is known and a 
threshold of retrieval is given, then the number of 
documents which have correlation greater than the threshold 
can easily be obtained. Therefore, it is desirable to obtain 
the distribution of correlations in order to estimate the 


desired documents in each cluster. Several theoretical 
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distributions, namely, the C-function [2], the Poisson 
distribution [9,10] and the normal distribution [11], have 
been used to describe the distribution of correlations. The 
objective of this thesis is to evaluate the accuracy of 
these theoretical distributions and modify the C-function to 
alleviate some of the difficulties in the use of the C- 


function. 


In Chapter 2, some characterizations of the cC-function 
are derived and its relationships to the Poisson and the 
normal distributions are discussed. Experiments are 
performed comparing these theoretical distributions to the 
actual distribution of correlations. Experimental results 
demonstrate that if the collection is large the C-function 
fits rather poorly to the actual distrapution. This 
inaccuracy is largely due to the dependencies between terms 
(see section 2.1.1, for explanation of dependence.) Two 
modified C-functions that take dependencies of terms into 
consideration are proposed. The accuracy of the theoretical 


distribution is substantially improved. 


Tn\achapter 3,0°the» principal. component analysis=[ 5] 
technique is attempted to remove the dependencies between 
terms. A theoretical distribution based on the C-function is 
developed for this newly transformed set of data (documents 
and gueries.) Similar experiments of fitting the theoretical 
distribution to the actual distribution are . subsequently 
performed. Substantial improvement in accuracy of the 


estimation of the distribution is achieved. 
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CHAPTER D2 


THE DISTRIBUTION OF QUERY-DOCUMENT CORRELATIONS 


ON BINARY DATA 


It may be desirable to estimate the number of desired 
dgocuments in a large collection. The terms "cluster", 
"class", "collection" will be used interchangable throughout 
this thesis. Consider a collection of N documents 
represented by the document matrix ae se) of n rows and N 
columns, where the ith row represents the ith term, the jth 
coiumn represents the jth document sa and d= 1 if document 
D. has the ith term and ice otherwise. Similarly a query Q 
is cepresented by a vector Oakey reer ae) where q--# if the 


query has the ith term and aheC otherwise. 


A measurment of the closeness between document - and 


the query Q is given by the simple matching function f: 


d Cis 


nh 
f(D. ,Q)= ..° 
5 ) vee 


a 
The value of the function represents the number of terms in 
common between the document D, and the query Q. This measure 


of closeness or Similarity is referred to as the correlation 


of document ee with query Q. 


Once the correlations of all the documents with a given 
query are calculated, the number of documents having a 
correlation equal to a, for 1-1,;...,k (kK is the number of 


nonzero terms in the query), are obtained. Thus, the 
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distribution of correlations with respect to that particular 
query can be exhibited in a histogram. The correlation from 
0 to k are represented on the X-axis while the relative 
frequencies are plotted on the Y-axis. This process yields 
the actual distribution of correlations between the 
documents and the given query. However, it is not economical 
to compute the correlations between the documents in the 
cluster and the query. Hence, it is desirable to estimate 
this distribution by using the representative of the cluster 


without computing the correlations of documents. 


In this chapter, many different theoretical 
distributions are used to predict this distribution of 
correlations. All these theoretical distributions are 
generated by using only the representative of a cluster. 
Experiments are subsequently performed on a number of data 
collections (selected from the SMART retrieval system [3]) 
to demonstrate how closely these predicting distributions 


fit the actual distribution. 
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A function called the C-function was proposed [2] to 
predict the distribution of correlations of documents with 
respect to a given query. The details of the C-function and 


some related properties are discussed in this section. 
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Zaid The C-function and its Probability Generating 


Function: 


Consider a class of N documents and a query to be 
represented by a document matrix M and a query vector Q as 
described earlier. The occurrence or the non-occurrence of 
the ith term in the documents can be represented by X 
=(djeeeeed iy) where ie is 1i1f the ith term occurs in the 
jth document and 0 otherwise. Associated with Xie we define 
a Bernoulli random variable x, ' with parameter 
ae 2 eee This is the probability that a document in 
the ieee has the jth term. Thus a set of n random variables 


ee ie cig Ah eo, aS, . Obtained it which X,' describes the 


el n 
occurrence of the ith term in the class of documents. The 


representative of a cluster is defined as 


R= (Py "sPot vee rP')s 


Let Q be the given query containing k non-zero terms. 
Let Kircce rk, be the k independent random variables 
corresponding to the non-zero terms of query Q with Xs 
taking on value 1 with probability P; and value 0O with 
probability 1-P,- Since random variables correspond to the 
terms of query, the terms "random variable" and "term" have 
the same meaning throughout this thesis. The independencies 
between terms imply that the occurrence of a term in a 
document does not affect the occurrence of the other terms. 
Let X = 2 Xia then the probability that X has the value iis 
the 1STROR 8 frequency of documents whose correlation equals 


i and is designated by 
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= 0 otherwise 
where g is a permutation of the integers {1,2,...,k} and the 


Summation is taken over all (5) combinations [2]. 


The C-functaion can also be described by means of its 
probability generating function. The probability generating 
function of each Xe is given by a polynomial RS arta 
Since the generating function of the sum of independent 
Candom variables equals the product of the generating 
functions of these random variables [4], the probability 
generating Rane T TOURS E X = é Yet s ages). = I (py s+(l=p.)) 

tals te iit hes rs 
where the coefficient of s gives the relative frequency of 
documents at correlation = i. This generating function will 
be used to derive the mean and the variance of the C- 
function. To compute the C-function efficiently, we shall 


Simply multiply the probability generating functions of all 


the terms. 


DAA 2 The Mean Value and the Variance of the C-function 


It is of interest to derive the mean value and the 
Yariance of the distributionlof correlations of a,cluster of 
documents with respect to a given query. This mean value or 
the average query-document correlation may be viewed as a 
measurement of closeness between the "average" document in 


the cluster and the  dquerys*eClusters with larger daverage 
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query-document correlations are more related to the query. 
The variance indicates the average deviaticn of query- 
document correlations from the mean correlation. A small 
Variance implies that most of the documents in the cluster 
have query-document correlations close to the mean. Clusters 
With large mean value and small variance of document-query 
correlations are, of course, strongly related to the query 
and contain many desired documents. These clusters should 


Contain most’ oL the documents" of interest to ethe user. 


If g(s) denotes the probability generating function of 
integral valued random variable X, then the mean E(X)=g' (1) 
and the variance V(X)=g''(1)-g'(1)+(g'(1))2 where g'(1) and 
g'*(1) are the first and second derivatives of g at s=1 [4]. 

The mean and the variance of the C-function can now 
be easily derived : 
The mean is 


k 
B()H EC (E(X,)) 


1 i=1 


and the variance is obtained as 
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= y V(X.) (by independence of X's) 
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The variance of the C-function may be used to estimate 
the variance of the actual distribution of the query- 
document correlations. The following proposition shows that 
the mean of the C-function is equal to the mean of the 


actual distribution. 


Let ase 1=0,.--.,k be the number of documents having i 
terms in common with the query and N be the number of 


documents in the cluster, the mean of the gquery-document 
k 

correlations which is fF a, ei/n is equal to 2» p.. 
1=0 sak i 

Proof : 


Consider a matrix where the columns represent the k 
terms and the rows represent the N documents. The ({i,j3)th 
entry in this matrix is 1 if the ith document has the jth 
term and 0 otherwise. Then *s Lea, is the number of i's in 
the matrix, counting cat oe ee If the 1's are counted 
columnwise, then we obtain E Wep,- Hence, the result 


Teas 
follows. sg 


FESS 2 Approximations of the C-function by Other 


Probability Functions. 


In addition to the C-function, the Poisson distribution 
[9,10,14,19] and the normal distribution [11] have been 
proposed by others for describing the distribution of 


correlations. The relationships of these distributions to 
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tne G-function are discussed next. 


The following, proposition »sshows» that ythe C=function 


approaches the Poisson distribution with the (same) 
k 

mean, x P., Whenever all ee are sufficiently small. 
i=1 
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where ds, 1< j £ 2, are respectively the density functions 
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The proposition is proved by applying (2.1) to 


exercises. ( i2), and (14) "on pe 286 0£ vols 2 in (4). 2 


The next two propositions demonstrate that the C- 


function can be approximated by the normal distribution 
k k. 

whose mean is x p; and whose variance is y Pp; (1-p,) when 
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Proof : 

It is easy to see that the numerator of A does not 
increase and the denominator of A strictly increases as ee 
increase in [{0,1/2). Since PCD ae De) = A (iopiy sere tea 


the desired result follows. a 


As mentioned in the last section, different theoretical 
distributions, namely, the C-function, the Poisson 
distribution and the normal distribution are used to predict 
the actual distribution of correlations of the documents 
with respect to a given query. Experiments are performed to 
find out the accuracy of such predictions. The Chi-square 
test of the goodness of fit is used as a criterion to judge 
the accuracy of a given theoretical distribution to the 
actual destribution. The null hypothesis is that the actual 
distribution follows the theoretical distribution. 
Initially, there are (k+1) classes or cells corresponding to 
each of the correlation values from 0 to ks. By the theory of 
Chi-square test, the neighbouring classes are combined until 
the smallest expected frequency in each class is at least 5. 
Experimental results are based on a 5% level of significance 
using m-1 degrees of freedom, where m is the number of 


classes that result after neighboring classes are combined. 
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A "data collection" in the SMART system [3] consists of 
a class or a collection of documents and a set of queries. 
With respect to each query, a Chi-square test is performed 
to fit the theoretical distribution to the actual 
distribution. Although the accuracy may be judged by the 
Chi-square quantity obtained, our discussion is based on the 
percentage of the good fits of tests(the number of tests = 
the number of queries in the collection.) The following 
collections are chosen from the SMART system [3]. They are 
ABIABTH, ADINUL, CRN2NUL, CRN2TH, CRN4SNUL, MEDNUL, and 


CRNI4TH. 


onze Evaluation of the Accuracy in Predicting the Actual 


Distribution by the C-function. 


The first theoretical distribution to be tested is the 
C-function. The results are given in the first column of 


Table 1. 


Tt is found (Table 1) that the actual distribution and 
the C-function distribution are close to each other for most 
queries found in small collections, as in collection ADIABTH 
(97.1% good fits) and ADINUL (88.6% good fits) [3]. They 
differ substantially for most queries in the larger 
collections. For example, only 4.9% of the queries in the 


actual distributions follow the cC-function in collection 


CRNISTH [3]. 


Two ceasons are suspected for this bad result: 
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Terms are not independent. 

It shoud be pointed out that the C-function is 
derived based on the assumption that all terms are 
independent. If this condition holds, the cC-function 
would be a good approximation to the actual 
distribution. However, this ideal condition may not be 
Satisfied by the actual data. The contingency table 
test based on the Chi-square test is used to find out 
the degree of independence of the terms of the queries. 
Experiments are performed for some of the queries in 
the CRN14TH collection. Sample results tested on the 
CRNI14TH collection are given in Table 2. 249 (34%) out 
of a total of 734 term pairs (a sample) are found to be 
dependent in the CRNI4TH collection. In contrast, 69 
(7%) out of 988 term pairs (a sample) are dependent in 
the ADINUL collection. 

Due to the dependence between terms found in large 
collections, the C-function is rather different from 
the actual distribution. The reason that terms in large 
collections are likely to be dependent is discussed in 


seceuion 2Za2.242% 


The limitation of the Chi-square test. 

Since the Chi-square distribution has been 
established as the limiting distribution as the size of 
aisamplesincreasess[ 1494n@git itis clear tthattranvlazrge 
sample is recommended for the test. If the actual 


distribution follows exactly a certain theoretical 
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distribution, then the more document we Sample, the 
better will be the fit of the actual distribution to 
the theoretical distribution. On the other hand, if the 
actual distribution does not follow a certain 
theoretical distribution exactly, but just close to it, 
then the actual distribution from a small sample will 
tend to fit the theoretical one easily while the actual 
one froma large sample is rather difficult to fit the 


theoretical distribution. 


In his original paper, Karl Pearson [18] was aware 
of this. He writes, "Nor again does it appear to follow 
that if the number! be largely increased the same curve 
will still be a good fit. Roughly, the Chi-sgquare's? of 
two samples appear to vary for the same grouping as 
their total contents. Hence, if a curve be a good fit 
for a large sample, it will be good for a small one, 
but the converse is not true, and a larger sample may 
show that our theoretical frequency gives only an 


approximate law for samples of a certain size." 


With the above concept in mind, it is unfair to 
compare results of different size of collection. 
Therefore, different sizes of samples were then 
randomly selected from the CRN14TH collection. Similar 
experiments were performed on these samples. Results 
given in Table 3 demonstrate that the C-function 
approximates the actual distribution much better for 
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2 the Chi-square quantity. 
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small samples. The results of poor approximation in 
Hatrge coticction ispartlyidue ftoctthis® ‘linitatiof#£fiiso£ 
the @eni<squares testi pithis adisiculty bewithathedehi- 
Square test is encountered throughout all experiments 


in this thesis. 


It should also be pointed out that the C-function 
generally under-estimates the "beginning part" and the 
"ending part" of the actual distribution. The beginning part 
of the distribution indicates the frequencies at small 
correlation values while the ending part refers to those 


having large correlations. 


Pal ore The C-function with Modifications 


In view of the fact that some terms are dependent in 
large collections, the C-function is modified to take into 


consideration this dependence of terms. 


Docee | The Modified C-function with Dependent Terms Being 


Combined. 


The C-function is modified as follows: The dependence 
of a pair of terms is measured by the Chi-square quantity 
obtained from ccntingency table test. This quantity is 
treated as a measurement of the degree of dependence of the 
two terms. The term pair with a large quantity implies a 
high dependence between the two terms. AS an approximation, 
terms are assumed to depend on at most one other term. In 


the experiments to be described, a term is combined with the 
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most dependent term which has not yet been combined so far. 
For each term pair, a generating function is formed. Since 
terms in different term pairs are assumed to be independent, 
the generating functions of different term pairs and the 
generating functions of each independent term (if any) are 
multiplied together to obtain the generating function for 


the set of all terms. 


Experimentally, this modified C-function is produced as 


follows: 


Consider a set of terms RED UR in which some are 
dependent while the remainder are independent. The two most 
dependent terms, fs and age are first extracted from the set 
of dependent terms. Let fake Daou. } abe vathe ’oact ual 
frequencies of documents having both X. and Xie either Xx. Or 


X., and neither xX. nor a respectively. The polynomial 


a, .s2+b, stc,, is then used as the probability generating 
1) 1) ij 
function of the ccmbination of the terms Xi. and ee 
After X, and X are removed the above process is 
ahh e 
J 


repeated on the remaining terms until there are no dependent 
term-pairs left. For each dependent term-pair, a generating 
function is obtained. We next compute the generating 
functions for each of the remaining terms. By multiplying 
all the generating functions, the desired expected 
distribution is obtained. This is an approximation to the 
actual distribution with pairwise dependence of terms taken 
into consideration. The experimental results of 


approximating the actual distribution by the above expected 
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distribution are given in the second column of Table 1. The 
results (for example, improved from 4.9% to 24.4% good fits 
for the CRNI4TH collection) demonstrate a significant 
improvement over those obtained when the C-function was 


used. 


The difference between this modified distribution and 
the C-function is that, instead of using the product of the 
generating functions of x. and Xe the function 


J 
i deat aba is used. It is clear that a more complicated 


Tey gy 
cluster representative is required for this modification 
which provides the probabilities that two terms co-occur and 
exactly one term occurs. This modification is performed for 
each pair of dependent terms identified. The error in the 
estimation of the distribution for eae due to the 
dependence between terms such as Xe and = is eliminated. 
Thus, the modified distribution is a better approximation of 
the actual distribution than the C-function. Note, however, 
that the dependency of the terms in each term pair is 


eliminated, but the dependencies between terms in different 


term pairs still exist. 


Since this modification takes into account the 
dependencies between two terms, the better prediction 
obtained suggests that the terms are not independent in some 
data collections being tested. To get a better prediction a 
more complicated model is required which takes into account 
dependencies involving more than two terms. Another 


alternative is to remove the dependencies between terms 
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Which will be discussed in Chapter 3. 


2022222 The Modified C-function with a Classified Collection 


If the terms in a large collection are likely to be 
dependent whereas terms can be expected to be independent in 
small collection of similar documents, then if a large 
collection can be split into small clusters of similar 
documents, the C-function would be a good prediction of the 


actual distribution on each small cluster. 


In the following experiments, the theoretical 
distributions are produced as follows: 
First, the whole document collection is divided into two 
sets, the relevant and the irrelevant documents, with 
respect to each query and the corresponding C-functions are 
Obtained. Finally, these two distributions are combined as 
follows: 

the expected number of documents having i terms in 

common with a query = the expected number of relevant 

documents having i terms in common with the query + the 

expected number of irrelevant documents having i terms 


in common with the query, for i=0,...,K. 


Experimental results of fitting the above distribution 
to the actual distribution are given in the third column of 
Table 1. A substantial improvement is achieved for the 
collections of median size, for example, CRN4NUL (improved 
from 17.4% to 45.2%) and MEDNUL (from 40% to 63.3%.) 


However, Since the number of relevant documents is too small 
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(less than 1%) compared to the number of total documents in 
CRNI4TH (1400 documents), the modification has little effect 
on the expected distribution. The improvement is, hence, not 


Significant for this large collection. 


It should be pointed out that the above results made 
use of the assumption that the terms in the relevant set are 
independent and those in the irrelevant set are also 
independent. The better prediction obtained implies that the 
terms in a collection of similar documents are more likely 
to be independent. We may assume that a class of documents 
with similar characteristics have independent terms. If a 
collection is large, documents in the collection are 
unlikely to be Similar. Hence, the terms in large 
collections are likely to be more or less dependent. If a 
large collection can be classified properly into some smail 
clusters of homogeneous documents, this modified C-function 
is likely to be a good approximation to the actual 


distribution. 


Since the information available on the data collection 
is not sufficient to obtain clusters of homogeneous 
documents, the whole collection is split only into two sets 
(the relevant and the irrelevant documents.) However, it is 
possible to apply clustering algorithms or employ factor 


analysis [12,13] to partition a given set of documents. 
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oecee Evaluation of the Accuracy in Predicting the Actual 
Distribution by the Poisson Distribution and the 


Normal Distribution. 


Damerau [9] has assumed that the occurrence frequency 
of a term ina large collection of documents follows the 
Poisson distribution. However, Swets [11] has described the 
distribution of correlations by a normal distribution in his 
study of the measure of effectiveness of an information 


retrieval system. 


a a 


In this section, the accuracy of these two probability 
functions in predicting the actual distribution of 
correlations is considered. 


gears!) Prediction by the Poisson Distribution. 


In the following experiments, the Poisson distribution 


k 
with mean, x Py is used to approximate the actual 
1=1 
distribution. It is easy to show that the variance of the 
k 
Poisson distribution » Pir is larger than that of the C- 
i=l 


function. It can also be shown that the relative frequency 
for the Poisson distribution is greater than that of the C- 
function at correlation = 0. The frequency is also greater 
at k if k, the number of the terms of the given query, is 
not too small (greater than 3) and the mean of PR’ s tsinot 


too large (no larger than 1/2.) 


Experimental results in column 4 of Table 1 show that 


the Poisson distribution is a better approximation than the 
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C-function for large collections (for example, 34.2% good 
fits compare to 4.9% by the C-function in CRN14TH.) As 
pointed out in section 2.2.1, the experiments show that the 
C-function usually under-estimates the beginning part and 
the ending part of the distribution of correlations. Since 
the Poisson distribution has its beginning and ending part 
greater than those of the C-function, a better approximation 


to the actual distribution is achieved. 


The Poisson distribution is usually used to 
characterize a distribution of a random variable which can 
take on many values (for instance, the number of 
occurrences) but with small probability [17]. Since a large 
collection is likely to be more heterogeneous than a small 
collection and a particular term may not appear in too many 
documents, we assume that the probability that a term occurs 
in a document is small for a large collection. This 
satisfies the property of a Poisson distribution. Hence, 
Damerauts assumption that occurrence frequencies of a term 
follow a Poisson distribution should be considered 
meaningful only for a iarge collection of documents. 
Furthermore, the distribution of correlations follows the 
Poisson distribution by the addition property of the Poisson 
distributions of each tern. In summary, a Poisson 
distribution is suitable for a large collection. Harter [10] 
in his recent study found that the single Poisson 
Gistribution is not adequate. The two-Poisson model and the 


linked two-Poisson model were recently proposed [19,10,14]. 
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Leitatec Prediction by the Normal Distribution. 


In the following experiments a normal distribution with 


k k 
mean yr, Ee and variance y Palle) is used to approximate 


: : nik 

teal i=1 
the actual distribution of correlations. Since the normal 
distribution is centinuous, the area in the interval (i-1/2, 


i+1/2) is assumed to be the relative frequency of documents 


having correlation egual to i. 


Experimental results given in column 5 of Table 1 show 
that the normal distribution does not fit the actual 
distribution, (0% in the CRNI4TH collection) except for the 
ADINUL and ADIABTH collections. Experimental results show 
that the normal distribution has a worse approximation than 
the C-function. Since the p's are fairly small in most of 
the collections, except in the ADINUL collection, most of 
the documents have small correlations with the query. Hence, 
the distribution cf correlations is rather skewed towards 
the ending part. A normal distribution which is symmetric, 
is thus not likely to be a reasonable fit to an actual 


distribution that is skewed. 


Conseguently, with small Per Swets' [11] assumption 
of the normal distribution of correlations is questionable. 
In general, the P,"S tend to be small” for a large 
collection, and as mentioned, Swets' [11] model is not 
suitable for such a collection. However, if the means of the 
p,'s are neither too large nor too small (the means of the 
p's are usually greater than 0.25 in the ADINUL collection) 


the normal distribution may be appropriate. 
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CHAPTER 3 


THE DATA TRANSFORMATION AND THE DISTRIBUTION OF 


QUERY-DOCUMENT CORRELATIONS ON WEIGHTED DATA 


The experiments in Chapter 2 demonstrate that the 
dependencies between terms may cause the C-function to be 
rather different from the actual distribution of 
correlations. However, if the dependencies between terms can 
be removed, the C-function may fit the actual distribution 
well. We shall employ the principal component analysis [5], 
to transform a set of dependent terms to a eset of 
uncorrelated terms. Although the uncorrelation between terms 
does not imply their independence, the two concepts may be 
assumed to be equivalent for many practical purposes. 
Similar experiments will be subsequently performed on a data 


set obtained after performing the transformation. 


The transformation induced on the data by the method 
referred to above is described in this section. It 
transforms a set of dependent terms eC to a new set 
of uncorrelated concepts Fierce el} Fach transformed term 


(new concept) Y.. corresponds to a linear combination of 
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S6°1.4 The Concept of the Transformation 


Lae clan stormations) Ws, ybasedy on some fundamental 
theorems in linear algebra. Some preliminary ideas are 
introduced first. 

(1) Every symmetric matrix S of rank n has n independent 
eigenvectors associated with its n eigenvalues [6]. 
(2) If these n independent eigenvectors are chosen to be 


of unit length, then the matrix A=(a..) formed with 


we 
these n independent eigenvectors as re rows of A is 
Opthogonal?y that is, At=a7+ and A can be used to 
transform this symmetric matrix S into a diagonal 
matrix as shown below 

A eSea-t = p 


where D is the diagonal matrix having the eigenvalues 


of S along the diagonal [7]. 


pe) where a row 
27 4nN 


Consider a document matrix M=(d 
represents a term and a column represents a document. Define 
n random variables HS gon es Ii associated with term vectors 
of M. Without loss of generality, we may assume that all 


X.'s are standardized, with zero mean and unit variance. 
Hi be = 
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of S< n.) By the above theorems in linear algebra, there 
exists an orthogonal matrix A =(a..) such that A eseat=p 
Where the ith row of A is the ith eigenvector of S and D-is 
a diagonal matrix whose (i,i)th entry is the ith eigenvalue 


ofS. 


Define a set of n random variables {Yj reeert } as n 
linear combinations of {Xj cece eX } as follows: 


Y.= oy Cae a Slaleleeleleisicle'e ss a eels eiaie «itl >). 
Pe seth eae 


We now prove that these n new variables are uncorrelated and 


the variance of Y equals the ith eigenvalue of S. 
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where Yoo Y. are defined in (3), za ee) is the correlation 
coctricient of Y, and at V(Y,) is the variance, and he is 


the ith eigenvalue of S. 
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Sentez The Procedure of Transformation. 


The transformation based on the above concept can be 


carried out using the following steps: 


step 1: Construct the correlation coefficient matrix. 


S =sta(c > 2p lok {Xj eeee eX} where e-in ees the 
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Step 3: Form the new document matrix. 
(1) Standardize the document matrix M with respect to 


each term: 


dt = sha let 
3 (ci5 nae 

where Hy and g, are thes emiecan sand, (oe standard 

deviation of term Xie 


(li) Premultiply the standardized matrix by the 


transforming Matrix Bs 


M? = A eqn 
nN nn nN 
where Dagny, is the document matrix after 
standardization. 


It should be noted that the eigenvalues are arranged in 
descending Grdem in Step. (2.1). a they’ reason “for this 
arrangement will be discussed in the next section. The re- 
scaling in step (2.ii) is performed to ensure that the 


eigenvectors are unique. 


Queries can be transformed in the same way as 
documents. Consider a binary query Q= (q, eeee rd) as 
descr bedi) -Chaptorui2.ee Os ELE se transformed to 
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ZS atleos The Characterization of the Transformation. 


The new terms obtained, besides being uncorrelated, 


also feature the following properties. 


3.12£3.1 The Number of Terms after Transformation. 


It is clear that the above transformation will produce 
the same number of new terms as there are old ones, provided 
Se is a nonsingular matrix. If 5a is a Singular matrix, a 
matrix 3 of rank m can be obtained by performing 


elementary row and column operations on San° Hence, Mt 
eigenvalues associated with m independent eigenvectors are 
extracted from Smm and the number of new transformed terms 


is reduced to on. 


Since each component of the eigenvectors is a real 
number after re-scaling of an eigenvector, every entry of 
transforming matrix A will be a real number. Furthermore, 
the document matrix M also becomes a real matrix M" after 
the standardization of each term. Hence, the above 
transformation will transfom a binary document matrix to a 
real matrix M'. We shall treat the real values in the matrix 
as the weights (importance) of the terms in the documents. 
Since the terms after the transformation take on real 
values, the storage is significantly increased in relation 
to the original binary representation of documents and 
queries. In addition, the amount of computations required to 
produce both the actual and expected distributions (to be 


described in section 3.2) are substantially higher, 
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especially when the total number of terms is large. Since 
the number of terms is usually very large (for example, 
there are 1345 distinct terms in the documents of the 
CRN4NUL collection), it is often desirable to discard some 


of newly formed terms. 


Intuitively, the terms with small variance are less 
important than those with large variances. For example, if 
every document has a particular term with the same 
importance (weight), this term may be ignored, since this 
term does not differentiate one document from another. It 
may be argued that terms with larger variance are more 
capable of differentiating documents from one another. 
Hence, if it is required to discard some transformed terms, 
the terms with smaller variance should be discarded first. 
By proposition 3.1, the variance of the ith transformed tern 
is equal to the ith eigenvalue of S. Since in the step (2.1) 
the eigenvalues of S were arranged in descending order, the 
transformed terms are in descending order according to their 
variances. Terms with small variance can, therefore, be 


discarded easily. 
3.1.3.2 Other Remarks on the Transformation. 


(a). The pair-wise correlations between the newly formed 
terms have been shown to be zero. Unfortunately, 
though the uncorrelation between terms does not 
imply their independence [5]. However, if the 


original terms set {laeserk 3 has multivariate 
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normal distribution [5], the new terms obtained 
after transformation {Yiecees¥ are mutually 


independent and are normally distributed [8,5]. 


(bt). The covariance matrix can be used, instead of the 
correlation coefficient matrix, while extracting 
eigenvalues and constructing the transforming 
matrix. If the covariance matrix is used, then the 
standardization of the document matrix in step(3) 
can be omitted. In general, the transformed 
variables obtained from the correlation coefficient 
matrix are generally different from those of the 


covariance matrix. 


(C). The transfomation described above can be applied to 


both binary and real weighted documents and queries. 


3.2 Zhe Distribution of Correlations on Weighted Data 


We have shown that the weights of terms in the 
documents after transformation are no longer binary. They 
are real numbers instead. In the following sections, the 
actual and the theoretical distributions of the correlations 


will be derived for the set of real data. 
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3.2.1 The Actual Distribution of Correlations. 


Consider a document matrix MN S(O) where the rows 
represent terms and the columns represent documents and the 
entry dis‘ represents the weight of the ith tern Y. in the 
jth document nee Similarly, let a query Q' be represented by 
a vector (Ghee 70%) where Tey is the weight of the ith 


term in the query. The correlation of document D, and query 
5 


Q is designated by 


n 
F : § a = ' a es 
HERG! DAG: Ge eds 


1=1 Lj 

Since the weights of the terms are real numbers instead 

of binary integers, it is unlikely that any two documents 
will have exactly the same correlation with respect to a 
given query. Once all query-document correlations are 
calculated, we have a set of real correlation values 
scattered over some range. Hence, we first group these 
correlation values into a number of intervals and then 
determine the number of the documents having correlations 
falling in each of the intervals. The length of the interval 
is calied the interval width. After all correlation values 
are grouped into intervais and the frequencies in each 
interval are computed, each individual correlation value may 
be ignored. The midpoint will be used to represent the 
correlations of documents for those have query-document 
correlations falling in that particular interval. Let ah 
for i=1,-..,m (number of intervals) be the midpoint of 
various intervals. A method to decide the interval to which 


the correlation x of the document belongs is by 
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where w 1S a chosen interval width. The frequency of 
documents having correlation a is obtained by counting the 
frequency of documents having correlation in each interval 
De Thus, the desired distribution of correlations is 


obtained. 


Bare ere Theoretical Distribution - A Generalized cC-function 


for Weighted Data. 


A theoretical distribution similar to the C-function 
for binary data will be derived in this section for the 
Situation where the weights of terms in documents and 


queries are real. 


Consider the same document matrix edi asic and query 
OR ADT erweetn ds 1) as described in the last section where all 
weights are real number and all the terms are assumed to be 


independent of the other terms. We shall first construct the 


representative as follows: 


The values that the weights of a term can assume in 
documents are grouped into a number of intervals. Associated 
with each interval, the midpoint of the interval and the 
relative frequency of th weights of the term in the interval 
are obtained. For ease of computation, the weights of all 
the documents in a given interval are assumed to coincide 
with the midpoint. Thus, for each term we have a 


distribution of weights in intervals. The representative of 
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the collection (cluster) is then defined to be the set of 


these distributions of weights of terms in documents. 


The process cf generating the theoretical distribution 
of correlations with respect to the query is presented in 
two stages: first stage consists of finding the distribution 
of correlations for each term in the query as if each tern 
waS a query with one term and the next stage involves 
combining these distributions to get the desired theoretical 


distribution. 


To find out the distribution of correlations for term 


Bs we Simply multiply the midpoints of various weight 


ce 
' 
intervals of the weight distribution of terms Ys by Gis the 
weights of tern tis in the query. fhe distributions of 
correlations for each term can be represented by a 


probability generating function, which is a polynomial. 


Without loss of generality, we assume that the 
distributions of correlations for the various terms have the 
same number of intervals of correlations and the sane 
midpoints of intervals. Since all the terms are assumed to 
be independent, the distribtution of correlations with 
respect to the query can then be obtained as the convolution 
of these distributions correponding to the terms in the 
query or, equivalently, as the multiplication of all the 


generating functions. 
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3-3 Experimental Results 


The above tranformation was performed on the CRN4NUL[3 ] 
collection. There are 1345 terms in the documents and 467 
terms in the queries in the CRN4NUL collection. Since the 
cost of transformation increases rapidly as the number of 
terms increases, only the 467 query terms are transformed. 
The same number of new terms is obtained. In order to avoid 
the high cost in computing the distribution of query- 
document correlations, terms whose variances (eigenvalues) 
are smaller than one are discarded according to the Kaiser- 
Guttman criterion [5]. As a result, only 178 new terms are 


obtained. 


The 100 queries in the queries collection have been 
tested. The expected distribution and the actual 
distribution are produced with respect to a query as 
decribed in 3.2. The Chi-square test (at the 5% significance 
level) is then performed to determine how well these two 
distributions match. 71 out of 100 queries tested satisfy 


the Chi-square test. 
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CHAPTER 4 


CONCLUSION 


It is a common practice to organize documents into 
clusters in an information retrieval systen. While 
retrieving the desired documents in such an environment, it 
is often desirable to predict the distribution of query- 


document correlations of each cluster. 


The C-function is found to be suitable for a small 
collection of similar documents, since terms are likely to 
be independent in such a collection. Experimental results 
demonstrate that, for a small collection of documents (for 
example, the ADIABTH and ADINUL collection with size = 82), 
the C-function is a good approximation to the actual 
distribution. It is generally better than the Poisson and 
the normal distributions for these small collections. fhe 
normal distribution would be a good prediction in these 
small collections if the P,'S are neither too small nor _ to 
large. However, in such a Situation, the C-function predicts 
the actual distribution as weil as the norma approaches the 
normal distribution for such P,'s Values. The Poisson 
distribution is not suitable in this situation. In summary, 
as a predicting distribution of document-query correlations, 
the C-function is most appropriate for small collections 


among these three distributions. 


Since a large collection of documents is usually not 
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homogeneous, the probability that a document has a 
particular term is small. The distribution of correlations 
at: such a collection is likely to be skewed to the ending 
part (the part with large correlations), since most of the 
documents have small correlations with the query. Therefore, 
a symmetric normal distribution should not be used to 
predict the distribution of correlations for a large 
collection of documents. Experimental results demonstrate 
the normal distribution has very poor accuracy in estimating 


the distribution of correlations for large collection of 


documents (size 2 200.) In contrast, the Poisson 
distribution is more suitable for such a_e situation. 
Experimental results have shown that the Poisson 


distribution has a much better prediction than the normal 


distribution and the C-function in large collections. 


Experimental results show that the modified C-function 
that takes into consideration the pair-wise dependence of 
terms shows better prediction than the unmodified Cc- 
function. This fact suggests that the terms in the data 
collection being tested are not totally independent. In 
order to obtain even better prediction a more complex model 
(for example, a model that takes into account the 
dependencies between more than two terms) is required. 
Another approach is to modify the data collection such that 
terms are more or less independent. No method of removing 
the dependencies between terms is available presently unless 
the original terms are multi-variate normal distributed. 


However, it is possible to transform a set of dependent 
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teres to a set of: uncorrelated: terus. The expected 
distribution as applied to the newly transformed aata is 


found to be a much better approximation. 


Experimental results indicate that terms are more 
likely to be dependent in a larger collection than ina 
smaller collection. Therefore, another alternative to handle 
a Largescoilectionjisiby )clessifying ther, collection into 


Snatieclusters. 
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| dist. | { C-func. | C-func.. | { | 
| % {| C-func. |{ with { with a | Poisson | Normal | 
| of good | { dep. term| classi- |{ dist. [ee aaists _| 
{ fits] { pairs {| fied { { { 
{i Ccol~ | | being { collect-| | | 
| lection | {| combined {| ion { | | 
$—-—--— ------- $——— ---- = +—---------- $——=------ fmt ee ee eee oe + 
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Table 1. Results of the Chi-square test of goodness of fit at 
the 5 % significance level and (k-1) degrees of freedon 
(k = no. of cells.) 


ND: no. of documents in the collection 

NQ: no. of queries (= no. of tests) in the collection 
* 3: no. cf good fits 

*x: one of the tests fails due to degree of freedom < 1 
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= es ee ee ee ee ee ge ee ee ee ee we ey ee ee ee es es es ee ae a es ee ee 


| { | | average { No. of | Average | 
| Query | Now of { Now of | Chi- { dep. { Chi-square | 
ily RO... | terms {|term { square { term | quantity of| 
{ | { pairs {4 guantity |{ pairs j dep. tern | 
{ | { { i {| pairs ! 
i | | ( { | l 
+—------ t—~—------ +-------~ $—---------- +--------- +—----------- + 
{ i i { | | | 
| aC ee | 4 i 6 | 6.0 { Z { 1503 | 
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{ i i { { | | 
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eos. ul 10 | 45 ‘| Bot | 11 | Tet { 
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fa Zt \] 6 | 15 | 60.0 { 6 | 148.0 { 
onus ei 10 { 45 | 711.0 | 17 | PAE ENS, { 
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Table 2. Sample results (from the CRN14TH collection) of 2 by 2 
contingency table test for independence of term pairs 
of queries at the 5% significance level and 1 degree 
of freedon. 


The critical value is 3.8. 
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| size {| { | | I 
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Table 3. Chi-square test of goodness of fit un samples of 
the CRN14TH collection. 


No. of queries (= no. of tests) = 225. 
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